Using deep Gaussian processes with GPflux for Bayesian optimization.

[1]:
import numpy as np
import tensorflow as tf

For GPflux models we must use tf.keras.backend.set_floatx() to set the Keras backend float to the value consistent with GPflow (GPflow defaults to float64). Otherwise the code will crash with a ValueError!

[2]:
np.random.seed(1794)
tf.random.set_seed(1794)
tf.keras.backend.set_floatx("float64")

Describe the problem

In this notebook, we show how to use deep Gaussian processes (DGPs) for Bayesian optimization using Trieste and GPflux. DGPs may be better for modeling non-stationary objective functions than standard GP surrogates, as discussed in [DKvdH+17, HBB+19].

In this example, we look to find the minimum value of the two- and five-dimensional Michalewicz functions over the hypercubes \([0, pi]^2\)/\([0, pi]^5\). We compare a two-layer DGP model with GPR, using Thompson sampling for both.

The Michalewicz functions are highly non-stationary and have a global minimum that’s hard to find, so DGPs might be more suitable than standard GPs, which may struggle because they typically have stationary kernels that cannot easily model non-stationarities.

[3]:
import gpflow
from trieste.objectives import (
    michalewicz_2,
    michalewicz_5,
    MICHALEWICZ_2_MINIMUM,
    MICHALEWICZ_5_MINIMUM,
    MICHALEWICZ_2_SEARCH_SPACE,
    MICHALEWICZ_5_SEARCH_SPACE
)
from trieste.objectives.utils import mk_observer
from util.plotting_plotly import plot_function_plotly

function = michalewicz_2
F_MINIMIZER = MICHALEWICZ_2_MINIMUM

search_space = MICHALEWICZ_2_SEARCH_SPACE

fig = plot_function_plotly(
    function,
    search_space.lower,
    search_space.upper,
    grid_density=100
)
fig.update_layout(height=800, width=800)
fig.show()

Sample the observer over the search space

We set up the observer as usual, using Sobol sampling to sample the initial points.

[4]:
import trieste

observer = mk_observer(function)

num_initial_points = 20
num_steps = 20
initial_query_points = search_space.sample_sobol(num_initial_points)
initial_data = observer(initial_query_points)

Model the objective function

The Bayesian optimization procedure estimates the next best points to query by using a probabilistic model of the objective. We’ll use a two layer deep Gaussian process (DGP), built using GPflux. We also compare to a (shallow) GP.

Since DGPs can be hard to build, Trieste provides some basic architectures: here we use the build_vanilla_deep_gp function which returns a GPflux model of DeepGP class. As with other models (e.g. GPflow), we cannot use it directly in Bayesian optimization routines, we need to pass it through an appropriate wrapper, DeepGaussianProcess wrapper in this case.

Few other useful notes regarding building a DGP model. The DGP model requires us to specify the number of inducing points, as we don’t have the true posterior. To train the model we have to use a stochastic optimizer; Adam is used by default, but we can use other stochastic optimizers from TensorFlow. GPflux allows us to use the Keras fit method, which makes optimizing a lot easier - this method is used in the background for training the model. For this problem we need to modify the default optimizer settings slightly, so we initialize a new optimizer wrapper instance (Optimizer) with custom minimization arguments minimize_args which are passed to Keras’ fit method (check Keras API documentation for a list of possible arguments).

[5]:
from gpflow.utilities import set_trainable

from trieste.models.gpflux import DeepGaussianProcess, build_vanilla_deep_gp
from trieste.models.optimizer import Optimizer


def build_dgp_model(data):
    variance = tf.math.reduce_variance(data.observations)

    dgp = build_vanilla_deep_gp(data.query_points, num_layers=2, num_inducing=100)
    dgp.f_layers[-1].kernel.kernel.variance.assign(variance)
    dgp.f_layers[-1].mean_function = gpflow.mean_functions.Constant()
    dgp.likelihood_layer.likelihood.variance.assign(1e-5)
    set_trainable(dgp.likelihood_layer.likelihood.variance, False)

    # These are just arguments for the Keras `fit` method.
    minimize_args = {
        "batch_size": 100,
        "epochs": 200,
        "verbose": 0,
    }
    optimizer = Optimizer(tf.optimizers.Adam(0.01), minimize_args)

    return DeepGaussianProcess(model=dgp, optimizer=optimizer)


dgp_model = build_dgp_model(initial_data)
WARNING:tensorflow:From /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/tensorflow_probability/python/distributions/distribution.py:346: calling MultivariateNormalDiag.__init__ (from tensorflow_probability.python.distributions.mvn_diag) with scale_identity_multiplier is deprecated and will be removed after 2020-01-01.
Instructions for updating:
`scale_identity_multiplier` is deprecated; please combine it into `scale_diag` directly instead.

Run the optimization loop

We can now run the Bayesian optimization loop by defining a BayesianOptimizer and calling its optimize method.

The optimizer uses an acquisition rule to choose where in the search space to try on each optimization step. We’ll start by using Thompson sampling.

We’ll run the optimizer for twenty steps. Note: this may take a while!

[6]:
from trieste.acquisition.rule import DiscreteThompsonSampling

bo = trieste.bayesian_optimizer.BayesianOptimizer(observer, search_space)
grid_size = 1000
acquisition_rule = DiscreteThompsonSampling(grid_size, 1)

# Note that the GPflux interface does not currently support using `track_state=True`. This will be
# addressed in a future update.
dgp_result = bo.optimize(num_steps, initial_data, dgp_model,
                         acquisition_rule=acquisition_rule, track_state=False)
dgp_dataset = dgp_result.try_get_final_dataset()
Optimization completed without errors

Explore the results

We can now get the best point found by the optimizer. Note this isn’t necessarily the point that was last evaluated.

[7]:
dgp_query_points = dgp_dataset.query_points.numpy()
dgp_observations = dgp_dataset.observations.numpy()

dgp_arg_min_idx = tf.squeeze(tf.argmin(dgp_observations, axis=0))

print(f"query point: {dgp_query_points[dgp_arg_min_idx, :]}")
print(f"observation: {dgp_observations[dgp_arg_min_idx, :]}")
query point: [2.18908725 1.56658573]
observation: [-1.79750852]

We can visualise how the optimizer performed as a three-dimensional plot

[8]:
from util.plotting_plotly import add_bo_points_plotly

fig = plot_function_plotly(function, search_space.lower, search_space.upper, grid_density=100)
fig.update_layout(height=800, width=800)

fig = add_bo_points_plotly(
    x=dgp_query_points[:, 0],
    y=dgp_query_points[:, 1],
    z=dgp_observations[:, 0],
    num_init=num_initial_points,
    idx_best=dgp_arg_min_idx,
    fig=fig,
)
fig.show()

We can visualise the model over the objective function by plotting the mean and 95% confidence intervals of its predictive distribution. Note that the DGP model is able to model the local structure of the true objective function.

[9]:
import matplotlib.pyplot as plt
from util.plotting import plot_regret
from util.plotting_plotly import plot_dgp_plotly

fig = plot_dgp_plotly(
    dgp_result.try_get_final_model().model_gpflux,  # type: ignore
    search_space.lower,
    search_space.upper,
    grid_density=100,
)

fig = add_bo_points_plotly(
    x=dgp_query_points[:, 0],
    y=dgp_query_points[:, 1],
    z=dgp_observations[:, 0],
    num_init=num_initial_points,
    idx_best=dgp_arg_min_idx,
    fig=fig,
    figrow=1,
    figcol=1,
)
fig.update_layout(height=800, width=800)
fig.show()

We now compare to a GP model with priors over the hyperparameters. We do not expect this to do as well because GP models cannot deal with non-stationary functions well.

[10]:
import gpflow
import tensorflow_probability as tfp

from trieste.models.gpflow import GaussianProcessRegression


def build_gp_model(data):
    variance = tf.math.reduce_variance(data.observations)
    kernel = gpflow.kernels.Matern52(variance=variance, lengthscales=[0.2]*data.query_points.shape[-1])
    prior_scale = tf.cast(1.0, dtype=tf.float64)
    kernel.variance.prior = tfp.distributions.LogNormal(tf.cast(-2.0, dtype=tf.float64), prior_scale)
    kernel.lengthscales.prior = tfp.distributions.LogNormal(tf.math.log(kernel.lengthscales), prior_scale)
    gpr = gpflow.models.GPR(data.astuple(), kernel, mean_function=gpflow.mean_functions.Constant(), noise_variance=1e-5)
    gpflow.set_trainable(gpr.likelihood, False)
    num_kernel_samples = 100

    return GaussianProcessRegression(gpr)


gp_model = build_gp_model(initial_data)

bo = trieste.bayesian_optimizer.BayesianOptimizer(observer, search_space)

result = bo.optimize(num_steps, initial_data, gp_model, acquisition_rule=acquisition_rule,
                     track_state=False)
gp_dataset = result.try_get_final_dataset()

gp_query_points = gp_dataset.query_points.numpy()
gp_observations = gp_dataset.observations.numpy()

gp_arg_min_idx = tf.squeeze(tf.argmin(gp_observations, axis=0))

print(f"query point: {gp_query_points[gp_arg_min_idx, :]}")
print(f"observation: {gp_observations[gp_arg_min_idx, :]}")

from util.plotting_plotly import plot_gp_plotly

fig = plot_gp_plotly(
    result.try_get_final_model().model,  # type: ignore
    search_space.lower,
    search_space.upper,
    grid_density=100,
)

fig = add_bo_points_plotly(
    x=gp_query_points[:, 0],
    y=gp_query_points[:, 1],
    z=gp_observations[:, 0],
    num_init=num_initial_points,
    idx_best=gp_arg_min_idx,
    fig=fig,
    figrow=1,
    figcol=1,
)
fig.update_layout(height=800, width=800)
fig.show()
Optimization completed without errors
query point: [2.54378408 1.5652403 ]
observation: [-1.04530869]

We see that the DGP model does a much better job at understanding the structure of the function. The standard Gaussian process model has a large signal variance and small lengthscales, which do not result in a good model of the true objective. On the other hand, the DGP model is at least able to infer the local structure around the observations.

We can also plot the regret curves of the two models side-by-side.

[11]:

gp_suboptimality = gp_observations - F_MINIMIZER.numpy()
dgp_suboptimality = dgp_observations - F_MINIMIZER.numpy()

_, ax = plt.subplots(1, 2)
plot_regret(dgp_suboptimality, ax[0], num_init=num_initial_points, idx_best=dgp_arg_min_idx)
plot_regret(gp_suboptimality, ax[1], num_init=num_initial_points, idx_best=gp_arg_min_idx)

ax[0].set_yscale("log")
ax[0].set_ylabel("Regret")
ax[0].set_ylim(0.5, 3)
ax[0].set_xlabel("# evaluations")
ax[0].set_title("DGP")

ax[1].set_title("GP")
ax[1].set_yscale("log")
ax[1].set_ylim(0.5, 3)
ax[1].set_xlabel("# evaluations")
[11]:
Text(0.5, 0, '# evaluations')
../_images/notebooks_deep_gaussian_processes_21_1.png

We might also expect that the DGP model will do better on higher dimensional data. We explore this by testing a higher-dimensional version of the Michalewicz dataset.

Set up the problem.

[12]:

function = michalewicz_5
F_MINIMIZER = MICHALEWICZ_5_MINIMUM

search_space = MICHALEWICZ_5_SEARCH_SPACE

observer = mk_observer(function)

num_initial_points = 50
num_steps = 50
initial_query_points = search_space.sample_sobol(num_initial_points)
initial_data = observer(initial_query_points)

Build the DGP model and run the Bayes opt loop.

[13]:

dgp_model = build_dgp_model(initial_data)

bo = trieste.bayesian_optimizer.BayesianOptimizer(observer, search_space)
acquisition_rule = DiscreteThompsonSampling(grid_size, 1)

dgp_result = bo.optimize(num_steps, initial_data, dgp_model,
                         acquisition_rule=acquisition_rule, track_state=False)
dgp_dataset = dgp_result.try_get_final_dataset()

dgp_query_points = dgp_dataset.query_points.numpy()
dgp_observations = dgp_dataset.observations.numpy()

dgp_arg_min_idx = tf.squeeze(tf.argmin(dgp_observations, axis=0))

print(f"query point: {dgp_query_points[dgp_arg_min_idx, :]}")
print(f"observation: {dgp_observations[dgp_arg_min_idx, :]}")

dgp_suboptimality = dgp_observations - F_MINIMIZER.numpy()
Optimization completed without errors
query point: [2.18519298 2.44722265 1.29701476 2.46369582 0.09738709]
observation: [-2.28221713]

Repeat the above for the GP model.

[14]:

gp_model = build_gp_model(initial_data)

bo = trieste.bayesian_optimizer.BayesianOptimizer(observer, search_space)

result = bo.optimize(num_steps, initial_data, gp_model, acquisition_rule=acquisition_rule,
                     track_state=False)
gp_dataset = result.try_get_final_dataset()

gp_query_points = gp_dataset.query_points.numpy()
gp_observations = gp_dataset.observations.numpy()

gp_arg_min_idx = tf.squeeze(tf.argmin(gp_observations, axis=0))

print(f"query point: {gp_query_points[gp_arg_min_idx, :]}")
print(f"observation: {gp_observations[gp_arg_min_idx, :]}")

gp_suboptimality = gp_observations - F_MINIMIZER.numpy()
Optimization completed without errors
query point: [2.28136499 2.73015024 1.46068527 2.48097424 0.18700185]
observation: [-1.81509633]

Plot the regret.

[15]:

_, ax = plt.subplots(1, 2)
plot_regret(dgp_suboptimality, ax[0], num_init=num_initial_points, idx_best=dgp_arg_min_idx)
plot_regret(gp_suboptimality, ax[1], num_init=num_initial_points, idx_best=gp_arg_min_idx)

ax[0].set_yscale("log")
ax[0].set_ylabel("Regret")
ax[0].set_ylim(1.5, 6)
ax[0].set_xlabel("# evaluations")
ax[0].set_title("DGP")

ax[1].set_title("GP")
ax[1].set_yscale("log")
ax[1].set_ylim(1.5, 6)
ax[1].set_xlabel("# evaluations")
[15]:
Text(0.5, 0, '# evaluations')
../_images/notebooks_deep_gaussian_processes_29_1.png

While still far from the optimum, it is considerably better than the GP.